A Study Using -gram Features for Text Categorization

نویسنده

  • Johannes Fürnkranz
چکیده

In this paper, we study the effect of using -grams (sequences of words of length ) for text categorization. We use an efficient algorithm for generating such -gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm RIPPER indicate that, after the removal of stop words, word sequences of length 2 or 3 are most useful. Using longer sequences reduces classification performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study Using n-gram Features for Text Categorization

In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm R IPPER indicate that, after the removal of stop words, word sequences of length 2 or...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

Image Categorization Using Texture Features

A method for finding all images from the same category as a given query image (termed categorization) using texture features is presented. The hypothesis that two images that are similar in texture are likely to belong to the same category, is examined. A new texture feature termed N M -gram is presented. It is based on theN -gram technique that is commonly used for text similarity. The process...

متن کامل

Employing Relation between Reading and Writing Skills on Age Based Categorization of Short Estonian Texts

In this paper, we present results of our study on age-based categorization of short texts as 85 words per author. We introduce a novel set of features that will reliably work with short texts, and is easy to extract from the text itself without any outside databases. These features were formerly known as variables in readability formulas. We tested datasets presented two age groups children and...

متن کامل

Eecient Text Categorization

We present an approach to text categorization using machine learning techniques. The approach is developed and tested on large text hierarchy named Ya-hoo that is available on the Web. We handle the large number of features and training examples by taking into account hierarchical structure of examples and using feature subset selection for large text data. The large number of categories is han...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998